This tutorial of RNA-seq analysis pipeline is all designed based on the HPC environment in CRI. Any changes on the metadata file (i.e.,
example/DLBC/DLBC.metadata.txt) will be considered as not running for practice and should expect the results not be the same as shown here.
Pipeline package is available on GitHub
The Center for Research Informatics (CRI) provides computational resources and expertise in biomedical informatics for researchers in the Biological Sciences Division (BSD) of the University of Chicago.
As a bioinformatics core, we are actively improving our pipelines and expanding pipeline functions. The tutorials will be updated in a timely manner but may not reflect the newest updates of the pipelines. Stay tuned with us for the latest pipeline release.
If you have any questions, comments, or suggestions, feel free to contact our core at bioinformatics@bsd.uchicago.edu or one of our bioinformaticians.
RNA sequencing (RNA-seq) is a revolutionary approach that uses the next-generation sequencing technologies to detect and quantify expressed transcripts in biological samples. Compared to other methods such as microarrays, RNA-seq provides a more unbiased assessment of the full range of transcripts and their isoforms with a greater dynamic range in expression quantification.
In this tutorial, you will learn how to use the CRI’s RNA-seq pipeline (available on both CRI HPC cluster and GitHub)) to analyze Illumina RNA sequencing data. The tutorial comprises the following Steps:
By the end of this tutorial, you will:
This tutorial is based on CRI’s high-performance computing (HPC) cluster. If you are not familiar with this newly assembled cluster, a concise user’s guide can be found here.
The RNA-seq data used in this tutorial are from a published paper that explores PRDM11 and lymphomagenesis.
We will use the data from the PRDM11 knockdown and wild-type samples. You are welcome to explore the full dataset on GEO (GSE56065).
In this tutorial, we use a subset of the data focusing on chr11 in human as the example dataset. The sample information are saved in the file DLBC.metadata.txt (see below).
Work Flow
There are six (partial) single-end RNA-seq sequencing libraries will be used as the example dataset In this tutorial. Their respective sample information is described in the metadata table example/DLBC.metadata.txt.
| Sample | Library | ReadGroup | LibType | Platform | SequencingCenter | Date | Lane | Unit | Flavor | Encoding | Run | Genome | NucleicAcid | Group | Location | Seqfile1 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| KO01 | KO01 | SRR1205282 | NS | Illumina | SRA | 2015-07-22 | 7 | FCC2B5CACXX | 1x49 | 33 | 0 | grch38 | rnaseq | KO | /group/bioinformatics/CRI_RNAseq_2018/example/data | KO01.test.fastq.gz |
| KO02 | KO02 | SRR1205283 | NS | Illumina | SRA | 2015-07-22 | 7 | FCC2B5CACXX | 1x49 | 33 | 0 | grch38 | rnaseq | KO | /group/bioinformatics/CRI_RNAseq_2018/example/data | KO02.test.fastq.gz |
| KO03 | KO03 | SRR1205284 | NS | Illumina | SRA | 2015-07-22 | 7 | FCC2B5CACXX | 1x49 | 33 | 0 | grch38 | rnaseq | KO | /group/bioinformatics/CRI_RNAseq_2018/example/data | KO03.test.fastq.gz |
| WT01 | WT01 | SRR1205285 | NS | Illumina | SRA | 2015-07-22 | 4 | FCC2C3CACXX | 1x49 | 33 | 0 | grch38 | rnaseq | WT | /group/bioinformatics/CRI_RNAseq_2018/example/data | WT01.test.fastq.gz |
| WT02 | WT02 | SRR1205286 | NS | Illumina | SRA | 2015-07-22 | 4 | FCC2C3CACXX | 1x49 | 33 | 0 | grch38 | rnaseq | WT | /group/bioinformatics/CRI_RNAseq_2018/example/data | WT02.test.fastq.gz |
| WT03 | WT03 | SRR1205287 | NS | Illumina | SRA | 2015-07-22 | 4 | FCC2C3CACXX | 1x49 | 33 | 0 | grch38 | rnaseq | WT | /group/bioinformatics/CRI_RNAseq_2018/example/data | WT03.test.fastq.gz |
Column LibType can be “NS” for unstranded, “RF” for first strand, or “FR” for second strand.
You can read this blog for more details of strand-specific RNA-seq.
For the persepctive of running as practice, here we use a subset of the human genome
GRCh38.primary_Gencode24_50bp_chr11for Steps 1~4, and the complete information after Step 5
We will use SSH (Secure Shell) to connect to CRI’s HPC. SSH now is included or can be installed in all standard operating systems (Windows, Linux, and OS X).
The login procedure varies slightly depending on whether you use a Mac/Unix/Linux computer or a Windows computer.
Connect to the login node of the CRI HPC cluster:
$ ssh -l username@gardner.cri.uchicago.eduType in the password when prompted
Make sure that you replace
usernamewith your login name.
Now you should be in your home directory after logging in
$ pwd
/home/usernameCreate a new directory (e.g., cri_rnaseq) under the home directory
$ mkdir cri_rnaseq
$ cd cri_rnaseqOne way to download the pipeline package via git clone
$ git clone git@//github.com/wenching/cri_rnaseq_2018.gitOr, you can use wget to download the pipeline package
$ wget https://github.com/wenching/cri_rnaseq_2018/archive/master.zip
$ unzip master.zip
$ mv cri_rnaseq_2018-master cri_rnaseq_2018Change working directory to pipeline dirctory
$ cd cri_rnaseq_2018
$ tree -d
|-- SRC
| |-- Python
| | |-- lib
| | |-- module
| | `-- util
| `-- R
| |-- module
| `-- util
|-- example
| |-- DLBC_full
| | `-- RNAseq
| | |-- DEG
| | | |-- deseq2
| | | | `-- featurecounts
| | | | `-- star
| | | |-- edger
| | | | `-- featurecounts
| | | | `-- star
| | | `-- limma
| | | `-- featurecounts
| | | `-- star
| `-- data
`-- references
`-- GRCh38.primary_Gencode24_50bp_chr11Raw sequencing data files (*.fastq.gz) are located at example/data/
$ tree example/data/
|-- KO01.test.fastq.gz
|-- KO02.test.fastq.gz
|-- KO03.test.fastq.gz
|-- WT01.test.fastq.gz
|-- WT02.test.fastq.gz
|-- WT03.test.fastq.gzGenome data are located at /group/bioinformatics/Workshops/cri2016/reference/rnaseq/GRCh38.primary_Gencode24_50bp_chr11
$ tree /group/bioinformatics/Workshops/cri2016/reference/rnaseq/GRCh38.primary_Gencode24_50bp_chr11
|-- GRCh38.primary_assembly.genome.chr11.chrom.size
|-- GRCh38.primary_assembly.genome.chr11.dict
|-- GRCh38.primary_assembly.genome.chr11.fa
|-- GRCh38.primary_assembly.genome.chr11.fa.fai
|-- gencode.v24.primary_assembly.annotation.chr11.bed12
|-- gencode.v24.primary_assembly.annotation.chr11.gtf
|-- gencode.v24.primary_assembly.annotation.chr11.gtf.geneinfo
|-- gencode.v24.primary_assembly.annotation.chr11.gtf.transcriptinfo
|-- gencode.v24.primary_assembly.annotation.chr11.rRNA.bed
|-- gencode.v24.primary_assembly.annotation.chr11.rRNA.interval_list
|-- gencode.v24.primary_assembly.annotation.chr11.refFlat.txtproject related files (i.e., metadata & configuration file) as used in this tutorial are located under example/
$ ls -l example/DLBC.*
|-- DLBC.metadata.txt
|-- DLBC.pipeline.yaml
Here are the first few lines in the configuration example file example/DLBC.pipeline.yaml
---
pipeline:
flags:
aligners:
run_star: True
quantifiers:
run_featurecounts: True
run_rsem: False
run_kallisto: False
callers:
run_edger: True
run_deseq2: True
run_limma: True
software:
main:
use_module: 0
adapter_pe: AGATCGGAAGAGCGGTTCAG,AGATCGGAAGAGCGTCGTGT
adapter_se: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA
fastq_format: 33
genome_assembly: grch38When running on another dataset, you will need to modify these
two filesand the master pipeline script (i.e.,Build_RNAseq.DLBC.sh) (as described below) accordingly.For instance, if you would like to turn off the DE analysis tool limma, you can set the respecitve paramter to ‘False’
- run_limma: False
Master pipeline script
$ cat Build_RNAseq.DLBC.sh
## build pipeline scripts
now=$(date +"%m-%d-%Y_%H:%M:%S")
## project info
project="DLBC"
SubmitRNAseqExe="Submit_RNAseq.$project.sh"
padding="example/"
## command
echo "START" `date` " Running build_rnaseq.py"
python3 SRC/Python/build_rnaseq.py \
--projdir $PWD/${padding}$project \
--metadata $PWD/${padding}$project.metadata.txt \
--config $PWD/${padding}$project.pipeline.yaml \
--systype cluster \
--threads 4 \
--log_file $PWD/Build_RNAseq.$project.$now.log
## submit pipeline master script
echo "START" `date` " Running ${padding}$project/Submit_RNAseq.$project.sh"
echo "bash ${padding}$project/Submit_RNAseq.$project.sh"
echo "END" `date`
Basically, when running on your own dataset, you will need to modify this master pipeline script (i.e.,
Build_RNAseq.DLBC.sh) accordingly.For instance, you can change respective parameters as follows.
- project=“MY_OWN_DATA_SET”
- padding=“” # In this case, the project pipeline scripts will be generated and saved under the same directory of the master pipeline script, instead of being under a named sub-directory (e.g., example/ in this tutorial).
The following commands will generate all necessary sub-task scripts in the pipeline
# load modules
$ module purge;module load gcc udunits python/3.6.0 R; module update
# this step is optional but it will install all necessary R packages ahead.
$ Rscript --vanilla /group/bioinformatics/cri_rnaseq_2018/SRC/R/util/prerequisite.packages.R
# create directories and generate all necessary scripts
$ bash Build_RNAseq.DLBC.sh
# run the entire pipeline with just this command
$ bash example/DLBC/Submit_RNAseq.DLBC.shBefore running, you need to know
The master BigDataScript script can be
ONLY run on a head/entry nodeother than any other compuatation node.
But, you can step by step run individual sub-task bash scripts in any computation node interactively.
$ module purge; module load gcc python/3.6.0; module update
$ bash Build_RNAseq.DLBC.shthis step will execute SRC/Python/build_rnaseq.py using python3 to generate all sub-task bash scripts and directories according to the provided metadata and configuration files (i.e., example/DLBC.metadata.txt and example/DLBC.pipeline.yaml )
$ tree example/DLBC
example/DLBC
`-- RNAseq
|-- Aln
| `-- star
| |-- KO01
| | `-- SRR1205282
| |-- KO02
| | `-- SRR1205283
| |-- KO03
| | `-- SRR1205284
| |-- WT01
| | `-- SRR1205285
| |-- WT02
| | `-- SRR1205286
| `-- WT03
| `-- SRR1205287
|-- AlnQC
| |-- picard
| | `-- star
| | |-- KO01
| | | `-- tmp
| | |-- KO02
| | | `-- tmp
| | |-- KO03
| | | `-- tmp
| | |-- WT01
| | | `-- tmp
| | |-- WT02
| | | `-- tmp
| | `-- WT03
| | `-- tmp
| `-- rseqc
| `-- star
| |-- KO01
| | `-- tmp
| |-- KO02
| | `-- tmp
| |-- KO03
| | `-- tmp
| |-- WT01
| | `-- tmp
| |-- WT02
| | `-- tmp
| `-- WT03
| `-- tmp
|-- DEG
| |-- deseq2
| | `-- featurecounts
| | `-- star
| |-- edger
| | `-- featurecounts
| | `-- star
| `-- limma
| `-- featurecounts
| `-- star
|-- GSEA
| `-- featurecounts
| `-- star
| `-- DLBC
|-- LociStat
| `-- featurecounts
| `-- star
| `-- DLBC
|-- QuantQC
| `-- featurecounts
| `-- star
|-- Quantification
| `-- featurecounts
| `-- star
| |-- KO01
| |-- KO02
| |-- KO03
| |-- WT01
| |-- WT02
| `-- WT03
|-- RawReadQC
| |-- KO01
| | `-- SRR1205282
| |-- KO02
| | `-- SRR1205283
| |-- KO03
| | `-- SRR1205284
| |-- WT01
| | `-- SRR1205285
| |-- WT02
| | `-- SRR1205286
| `-- WT03
| `-- SRR1205287
`-- shell_scriptsExecute the entire analysis by just one command
# This will start to run the entire pipeline.
# You can chekc teh BDS report to know the running status.
$ bash example/DLBC/Submit_RNAseq.DLBC.sh
Submit_RNAseq.DLBC.bds, so you will need to run this command on a head/entry node.For the first step, the pipeline will perform quality assessment on the raw fastq files.
The BDS code snippet for the sample KO01 will look like:
$ grep -A1 run.RawReadQC.FastQC.SRR1205282.sh example/DLBC/Submit_RNAseq.DLBC.bds
dep( [ '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/RawReadQC/KO01/SRR1205282/KO01.test_fastqc.zip' ] <- [ '/group/bioinformatics/CRI_RNAseq_2018/example/data/KO01.test.fastq.gz' ], cpus := 1, mem := 16*G, timeout := 72*hour, taskName := "FastQC.SRR1205282") sys bash /home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/shell_scripts/run.RawReadQC.FastQC.SRR1205282.sh; sleep 2
goal( [ '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/RawReadQC/KO01/SRR1205282/KO01.test_fastqc.zip' ] )
This code chunk will invoke the bash script example/DLBC/RNAseq/shell_scripts/run.RawReadQC.FastQC.SRR1205282.sh to execute FastQC on the KO01(SRR1205282) sequencing library.
After the completion of the entire pipeline, you can check FastQC report per individual libraries; for instance, the partial report of KO01 will be as follows or a full report.
You can check FastQC Help for more details about how to interpret a FastQC report.
Or, compare your reports to the example reports provided by FastQC for a Good Illumina Data or Bad Illumina Data.
In this step, the pipeline will conduct read alignment on the raw fastq files.
The BDS code snippet for the sample KO01 will look like:
$ grep -A1 run.alignRead.star.SRR1205282.sh example/DLBC/Submit_RNAseq.DLBC.bds
dep( [ '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/Aln/star/KO01/SRR1205282/SRR1205282.star.bam' ] <- [ '/group/bioinformatics/CRI_RNAseq_2018/example/data/KO01.test.fastq.gz' ], cpus := 4, mem := 64*G, timeout := 72*hour, taskName := "star.SRR1205282") sys bash /home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/shell_scripts/run.alignRead.star.SRR1205282.sh; sleep 2
goal( [ '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/Aln/star/KO01/SRR1205282/SRR1205282.star.bam' ] )
This code chunk will invoke the bash script example/DLBC/RNAseq/shell_scripts/run.alignRead.star.SRR1205282.sh to execute STAR on the KO01(SRR1205282) sequencing library.
After the completion of the entire pipeline, you can check the alignment result of each individual libraries; for instance, the result of KO01(SRR1205282) will be as follows.
$ tree example/DLBC/RNAseq/Aln/star/KO01/SRR1205282
example/DLBC/RNAseq/Aln/star/KO01/SRR1205282
|-- SRR1205282.star.Aligned.sortedByCoord.out.bam
|-- SRR1205282.star.Log.final.out
|-- SRR1205282.star.Log.out
|-- SRR1205282.star.Log.progress.out
|-- SRR1205282.star.SJ.out.tab
|-- SRR1205282.star.Unmapped.out.mate1
|-- SRR1205282.star.bai
|-- SRR1205282.star.bam -> SRR1205282.star.Aligned.sortedByCoord.out.bam
`-- run.alignRead.star.SRR1205282.log
You can check a log file (e.g., example/DLBC/RNAseq/Aln/star/KO01/SRR1205282/SRR1205282.star.Log.final.out) for more alignment information provided by STAR.
$ cat example/DLBC/RNAseq/Aln/star/KO01/SRR1205282/SRR1205282.star.Log.final.out
Started job on | Jul 25 13:38:32
Started mapping on | Jul 25 13:38:59
Finished on | Jul 25 13:39:03
Mapping speed, Million of reads per hour | 215.88
Number of input reads | 239866
Average input read length | 49
UNIQUE READS:
Uniquely mapped reads number | 238305
Uniquely mapped reads % | 99.35%
Average mapped length | 48.84
Number of splices: Total | 43900
Number of splices: Annotated (sjdb) | 43566
Number of splices: GT/AG | 43729
Number of splices: GC/AG | 171
Number of splices: AT/AC | 0
Number of splices: Non-canonical | 0
Mismatch rate per base, % | 0.22%
Deletion rate per base | 0.01%
Deletion average length | 1.98
Insertion rate per base | 0.00%
Insertion average length | 1.37
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 0
% of reads mapped to multiple loci | 0.00%
Number of reads mapped to too many loci | 1552
% of reads mapped to too many loci | 0.65%
UNMAPPED READS:
% of reads unmapped: too many mismatches | 0.00%
% of reads unmapped: too short | 0.00%
% of reads unmapped: other | 0.00%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%
In this step, the pipeline will conduct a QC on alignment result.
The BDS code snippets for the sample KO01 will look like:
$ grep -A1 run.alnQC.*.star.KO01.*.sh example/DLBC/Submit_RNAseq.DLBC.bds
dep( [ '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/AlnQC/picard/star/KO01/KO01.star.picard.RNA_Metrics', '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/AlnQC/picard/star/KO01/KO01.star.picard.RNA_Metrics.pdf' ] <- [ '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/Aln/star/KO01/KO01.star.bai' ], cpus := 4, mem := 32*G, timeout := 72*hour, taskName := "picard.star.KO01") sys bash /home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/shell_scripts/run.alnQC.picard.star.KO01.CollectRnaSeqMetrics.sh; sleep 2
goal( [ '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/AlnQC/picard/star/KO01/KO01.star.picard.RNA_Metrics', '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/AlnQC/picard/star/KO01/KO01.star.picard.RNA_Metrics.pdf' ] )
--
dep( [ '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/AlnQC/rseqc/star/KO01/KO01.star.rseqc.clipping_profile.xls', '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/AlnQC/rseqc/star/KO01/KO01.star.rseqc.clipping_profile.r', '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/AlnQC/rseqc/star/KO01/KO01.star.rseqc.clipping_profile.pdf' ] <- [ '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/Aln/star/KO01/KO01.star.bai' ], cpus := 1, mem := 8*G, timeout := 72*hour, taskName := "rseqc.star.KO01") sys bash /home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/shell_scripts/run.alnQC.rseqc.star.KO01.clipping_profile.py.sh; sleep 2
goal( [ '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/AlnQC/rseqc/star/KO01/KO01.star.rseqc.clipping_profile.xls', '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/AlnQC/rseqc/star/KO01/KO01.star.rseqc.clipping_profile.r', '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/AlnQC/rseqc/star/KO01/KO01.star.rseqc.clipping_profile.pdf' ] )
--
dep( [ '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/AlnQC/rseqc/star/KO01/KO01.star.rseqc.infer_experiment.txt' ] <- [ '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/Aln/star/KO01/KO01.star.bai' ], cpus := 1, mem := 8*G, timeout := 72*hour, taskName := "rseqc.star.KO01") sys bash /home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/shell_scripts/run.alnQC.rseqc.star.KO01.infer_experiment.py.sh; sleep 2
goal( [ '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/AlnQC/rseqc/star/KO01/KO01.star.rseqc.infer_experiment.txt' ] )
--
dep( [ '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/AlnQC/rseqc/star/KO01/KO01.star.rseqc.eRPKM.xls', '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/AlnQC/rseqc/star/KO01/KO01.star.rseqc.rawCount.xls', '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/AlnQC/rseqc/star/KO01/KO01.star.rseqc.saturation.r' ] <- [ '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/Aln/star/KO01/KO01.star.bai' ], cpus := 1, mem := 8*G, timeout := 72*hour, taskName := "rseqc.star.KO01") sys bash /home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/shell_scripts/run.alnQC.rseqc.star.KO01.RPKM_saturation.py.sh; sleep 2
goal( [ '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/AlnQC/rseqc/star/KO01/KO01.star.rseqc.eRPKM.xls', '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/AlnQC/rseqc/star/KO01/KO01.star.rseqc.rawCount.xls', '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/AlnQC/rseqc/star/KO01/KO01.star.rseqc.saturation.r' ] )
This code chunk will invoke few bash scripts (e.g., example/DLBC/RNAseq/shell_scripts/run.alnQC.picard.star.KO01.CollectRnaSeqMetrics.sh, run.alnQC.rseqc.star.KO01.clipping_profile.py.sh, run.alnQC.rseqc.star.KO01.infer_experiment.py.sh, and run.alnQC.rseqc.star.KO01.RPKM_saturation.py.sh) to execute alignment QC tools (i.e., Picard and RSeQC) on the sample KO01.
After the completion of the entire pipeline, you can check the alignment QC results of each individual samples; for instance, the results of KO01 will be as follows.
$ ls example/DLBC/RNAseq/AlnQC/*/star/KO01
example/DLBC/RNAseq/AlnQC/picard/star/KO01:
KO01.star.picard.RNA_Metrics KO01.star.picard.RNA_Metrics.pdf run.alnQC.picard.star.KO01.CollectRnaSeqMetrics.log tmp
example/DLBC/RNAseq/AlnQC/rseqc/star/KO01:
KO01.star.pdfseqc.saturation.pdf KO01.star.rseqc.eRPKM.xls run.alnQC.rseqc.star.KO01.RPKM_saturation.py.log
KO01.star.rseqc.clipping_profile.pdf KO01.star.rseqc.infer_experiment.txt run.alnQC.rseqc.star.KO01.clipping_profile.py.log
KO01.star.rseqc.clipping_profile.r KO01.star.rseqc.rawCount.xls tmp
KO01.star.rseqc.clipping_profile.xls KO01.star.rseqc.saturation.r
You can check alignment statistics (e.g., example/DLBC/RNAseq/AlnQC/picard/star/KO01/KO01.star.picard.RNA_Metrics) for more information provided by Picard.
$ head example/DLBC/RNAseq/AlnQC/picard/star/KO01/KO01.star.picard.RNA_Metrics
## htsjdk.samtools.metrics.StringHeader
# picard.analysis.CollectRnaSeqMetrics REF_FLAT=/group/bioinformatics/CRI_RNAseq_2018/example/reference/GRCh38.primary_Gencode24_50bp_chr11/gencode.v24.primary_assembly.annotation.chr11.refFlat.txt RIBOSOMAL_INTERVALS=/group/bioinformatics/CRI_RNAseq_2018/example/reference/GRCh38.primary_Gencode24_50bp_chr11/gencode.v24.primary_assembly.annotation.chr11.rRNA.interval_list STRAND_SPECIFICITY=NONE CHART_OUTPUT=/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/AlnQC/picard/star/KO01/KO01.star.picard.RNA_Metrics.pdf METRIC_ACCUMULATION_LEVEL=[SAMPLE, ALL_READS] INPUT=/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/Aln/star/KO01/KO01.star.bam OUTPUT=/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/AlnQC/picard/star/KO01/KO01.star.picard.RNA_Metrics TMP_DIR=[/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/AlnQC/picard/star/KO01/tmp] MINIMUM_LENGTH=500 RRNA_FRAGMENT_PERCENTAGE=0.8 ASSUME_SORTED=true STOP_AFTER=0 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
## htsjdk.samtools.metrics.StringHeader
# Started on: Wed Jul 25 13:41:44 CDT 2018
## METRICS CLASS picard.analysis.RnaSeqMetrics
PF_BASES PF_ALIGNED_BASES RIBOSOMAL_BASES CODING_BASES UTR_BASES INTRONIC_BASES INTERGENIC_BASES IGNORED_READS CORRECT_STRAND_READS INCORRECT_STRAND_READS PCT_RIBOSOMAL_BASES PCT_CODING_BASES PCT_UTR_BASES PCT_INTRONIC_BASES PCT_INTERGENIC_BASES PCT_MRNA_BASES PCT_USABLE_BASES PCT_CORRECT_STRAND_READS MEDIAN_CV_COVERAGE MEDIAN_5PRIME_BIAS MEDIAN_3PRIME_BIAS MEDIAN_5PRIME_TO_3PRIME_BIAS SAMPLE LIBRARY READ_GROUP
11676945 11638066 0 6586023 4106888 849952 95203 0 0 0 0 0.565904 0.352884 0.073032 0.008180.918788 0.915728 0 0.522396 0.602183 0.758017 0.701572
11676945 11638066 0 6586023 4106888 849952 95203 0 0 0 0 0.565904 0.352884 0.073032 0.008180.918788 0.915728 0 0.522396 0.602183 0.758017 0.701572 unknown
Or, the respective coverage plot of the sample KO01 produced by Picard will be as follows.
There are three alignment measurements performed using RSeQC.
The results will be as follows. Please check the RSeQC website for more measurements and details.
$cat example/DLBC/RNAseq/AlnQC/rseqc/star/KO01/KO01.star.rseqc.infer_experiment.txt
##
##
## This is SingleEnd Data
## Fraction of reads failed to determine: 0.0025
## Fraction of reads explained by "++,--": 0.6025
## Fraction of reads explained by "+-,-+": 0.3950
In this step, the pipeline will conduct expression quantification over alignments.
The BDS code snippet for the sample KO01 will look like:
$ grep -A1 run.quant.featurecounts.star.KO01.sh example/DLBC/Submit_RNAseq.DLBC.bds
dep( [ '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/Quantification/featurecounts/star/KO01/KO01.star.featurecounts.count' ] <- [ '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/Aln/star/KO01/KO01.star.bai' ], cpus := 4, mem := 32*G, timeout := 72*hour, taskName := "featurecounts.star.KO01") sys bash /home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/shell_scripts/run.quant.featurecounts.star.KO01.sh; sleep 2
goal( [ '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/Quantification/featurecounts/star/KO01/KO01.star.featurecounts.count' ] )
This code chunk will invoke the bash script (e.g., example/DLBC/RNAseq/shell_scripts/run.quant.featurecounts.star.KO01.sh) to execute expression quantification tool (i.e., Subread::featureCounts on the sample KO01.
After the completion of the entire pipeline, you can check the quantification results of each individual samples; for instance, the results of KO01 will be as follows.
$ tree example/DLBC/RNAseq/Quantification/featurecounts/star/KO01
example/DLBC/RNAseq/Quantification/featurecounts/star/KO01
|-- KO01.star.featurecounts.count
|-- KO01.star.featurecounts.count.jcounts
|-- KO01.star.featurecounts.count.summary
`-- run.quant.featurecounts.star.KO01.log
You can check quantification statistics (e.g., example/DLBC/RNAseq/Quantification/featurecounts/star/KO01KO01.star.featurecounts.count.summary) for more information provided by Subread::featureCounts
$ cat example/DLBC/RNAseq/Quantification/featurecounts/star/KO01/KO01.star.featurecounts.count.summary
Status /home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/Aln/star/KO01/KO01.star.bam
Assigned 213869
Unassigned_Unmapped 0
Unassigned_MappingQuality 0
Unassigned_Chimera 0
Unassigned_FragmentLength 0
Unassigned_Duplicate 0
Unassigned_MultiMapping 0
Unassigned_Secondary 0
Unassigned_Nonjunction 0
Unassigned_NoFeatures 17539
Unassigned_Overlapping_Length 0
Unassigned_Ambiguity 6897
Or, the top 10 most abundant genes in the sample KO01 (on chr11) will be as follows.
$ cat <(head -n2 example/DLBC/RNAseq/Quantification/featurecounts/star/KO01/KO01.star.featurecounts.count | tail -n+2 | cut -f1,7) <(cut -f1,7 example/DLBC/RNAseq/Quantification/featurecounts/star/KO01/KO01.star.featurecounts.count | sort -k2,2nr | head)
| Geneid | Chr | Start | End | Strand | Length | KO01 |
|---|---|---|---|---|---|---|
| ENSG00000175216.14 | chr11 | 46743048 | 46846229 |
|
8538 | 25908 |
| ENSG00000149187.17 | chr11 | 47465944 | 47565520 |
|
10441 | 17065 |
| ENSG00000166181.12 | chr11 | 43311963 | 43342676 |
|
4651 | 13992 |
| ENSG00000149177.12 | chr11 | 47980558 | 48170841 |
|
9620 | 13503 |
| ENSG00000244313.3 | chr11 | 46428653 | 46429150 |
|
498 | 10870 |
| ENSG00000030066.13 | chr11 | 47778087 | 47848467 |
|
8576 | 10493 |
| ENSG00000165916.8 | chr11 | 47418769 | 47426266 |
|
2276 | 9331 |
| ENSG00000149084.11 | chr11 | 43556436 | 43856610 |
|
8600 | 9247 |
| ENSG00000109919.9 | chr11 | 47617315 | 47642595 |
|
2897 | 8005 |
| ENSG00000109920.12 | chr11 | 47716517 | 47767301 |
|
7596 | 7883 |
In this step, the pipeline will identify differentially expressed genes (DEG) according to the alignment result files (i.e., BAM files) after the alignment step.
The BDS code snippets for the example dataset will look like:
$ grep -A1 run.call.*.featurecounts.star.DLBC.sh example/DLBC/Submit_RNAseq.DLBC.bds
dep( [ '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/DEG/edger/featurecounts/star/DLBC.star.featurecounts.edger.count.txt' ] <- [ '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/Quantification/featurecounts/star/KO01/KO01.star.featurecounts.count', '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/Quantification/featurecounts/star/KO02/KO02.star.featurecounts.count', '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/Quantification/featurecounts/star/KO03/KO03.star.featurecounts.count', '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/Quantification/featurecounts/star/WT01/WT01.star.featurecounts.count', '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/Quantification/featurecounts/star/WT02/WT02.star.featurecounts.count', '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/Quantification/featurecounts/star/WT03/WT03.star.featurecounts.count' ], cpus := 4, mem := 32*G, timeout := 72*hour, taskName := "edger.featurecounts.star.DLBC") sys bash /home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/shell_scripts/run.call.edger.featurecounts.star.DLBC.sh; sleep 2
goal( [ '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/DEG/edger/featurecounts/star/DLBC.star.featurecounts.edger.count.txt' ] )
--
dep( [ '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/DEG/deseq2/featurecounts/star/DLBC.star.featurecounts.deseq2.count.txt' ] <- [ '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/Quantification/featurecounts/star/KO01/KO01.star.featurecounts.count', '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/Quantification/featurecounts/star/KO02/KO02.star.featurecounts.count', '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/Quantification/featurecounts/star/KO03/KO03.star.featurecounts.count', '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/Quantification/featurecounts/star/WT01/WT01.star.featurecounts.count', '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/Quantification/featurecounts/star/WT02/WT02.star.featurecounts.count', '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/Quantification/featurecounts/star/WT03/WT03.star.featurecounts.count' ], cpus := 4, mem := 32*G, timeout := 72*hour, taskName := "deseq2.featurecounts.star.DLBC") sys bash /home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/shell_scripts/run.call.deseq2.featurecounts.star.DLBC.sh; sleep 2
goal( [ '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/DEG/deseq2/featurecounts/star/DLBC.star.featurecounts.deseq2.count.txt' ] )
--
dep( [ '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/DEG/limma/featurecounts/star/DLBC.star.featurecounts.limma.count.txt' ] <- [ '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/Quantification/featurecounts/star/KO01/KO01.star.featurecounts.count', '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/Quantification/featurecounts/star/KO02/KO02.star.featurecounts.count', '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/Quantification/featurecounts/star/KO03/KO03.star.featurecounts.count', '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/Quantification/featurecounts/star/WT01/WT01.star.featurecounts.count', '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/Quantification/featurecounts/star/WT02/WT02.star.featurecounts.count', '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/Quantification/featurecounts/star/WT03/WT03.star.featurecounts.count' ], cpus := 4, mem := 32*G, timeout := 72*hour, taskName := "limma.featurecounts.star.DLBC") sys bash /home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/shell_scripts/run.call.limma.featurecounts.star.DLBC.sh; sleep 2
goal( [ '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/DEG/limma/featurecounts/star/DLBC.star.featurecounts.limma.count.txt' ] )
This code chunk will invoke few bash scripts (e.g., example/DLBC/RNAseq/shell_scripts/run.call.edger.featurecounts.star.DLBC.sh, run.call.deseq2.featurecounts.star.DLBC.sh, and run.call.limma.featurecounts.star.DLBC.sh) to execute differential expression (DE) analysis using three the state-of-the-art tools (i.e., edgeR, DESeq2, and limma) on the example dataset of six samples from KO01 to WT03.
There are three DE analysis tools used in the current pipeline, including
After the completion of the entire pipeline, you can check the calling results of each individual methods; for instance, the analysis results of the example dataset will be as follows.
$ ls example/DLBC/RNAseq/DEG/*/featurecounts/star/
example/DLBC/RNAseq/DEG/deseq2/featurecounts/star/:
DLBC.star.featurecounts.deseq2.RData
DLBC.star.featurecounts.deseq2.count.ntd.meanSdPlot.pdf
DLBC.star.featurecounts.deseq2.count.ntd.txt
DLBC.star.featurecounts.deseq2.count.rld.meanSdPlot.pdf
DLBC.star.featurecounts.deseq2.count.rld.txt
DLBC.star.featurecounts.deseq2.count.txt
DLBC.star.featurecounts.deseq2.count.vst.meanSdPlot.pdf
DLBC.star.featurecounts.deseq2.count.vst.txt
DLBC.star.featurecounts.deseq2.plotDispEsts.pdf
DLBC.star.featurecounts.deseq2.plotMA.pdf
DLBC.star.featurecounts.deseq2.test.DEG.txt
DLBC.star.featurecounts.deseq2.test.txt
run.call.deseq2.featurecounts.star.DLBC.log
example/DLBC/RNAseq/DEG/edger/featurecounts/star/:
DLBC.star.featurecounts.edger.RData DLBC.star.featurecounts.edger.plotSmear.pdf
DLBC.star.featurecounts.edger.count.txt DLBC.star.featurecounts.edger.test.DEG.txt
DLBC.star.featurecounts.edger.plotBCV.pdf DLBC.star.featurecounts.edger.test.txt
DLBC.star.featurecounts.edger.plotMA.pdf run.call.edger.featurecounts.star.DLBC.log
example/DLBC/RNAseq/DEG/limma/featurecounts/star/:
DLBC.star.featurecounts.limma.RData
DLBC.star.featurecounts.limma.count.voom.meanSdPlot.pdf
DLBC.star.featurecounts.limma.count.txt
DLBC.star.featurecounts.limma.plotMA.pdf
DLBC.star.featurecounts.limma.test.DEG.txt
DLBC.star.featurecounts.limma.test.txt
DLBC.star.featurecounts.limma.voom.mean-variance.pdf
run.call.limma.featurecounts.star.DLBC.log
You can check statistical test results per gene (e.g., example/DLBC/RNAseq/DEG/deseq2/featurecounts/star/DLBC.star.featurecounts.deseq2.test.txt) for more information generated by each method.
$ head example/DLBC/RNAseq/DEG/deseq2/featurecounts/star/DLBC.star.featurecounts.deseq2.test.txt
| Geneid | baseMean | log2FoldChange | lfcSE | stat | pvalue | FDR | DEG | |
|---|---|---|---|---|---|---|---|---|
| 4249 | ENSG00000120738.7 | 5193.2844 | 1.960273 | 0.1147988 | 17.07573 | 0 | 0 | 1 |
| 9872 | ENSG00000170345.9 | 422.7627 | 1.960006 | 0.1246756 | 15.72085 | 0 | 0 | 1 |
| 4273 | ENSG00000113070.7 | 1441.0826 | 1.386086 | 0.1041939 | 13.30296 | 0 | 0 | 1 |
| 9601 | ENSG00000100867.14 | 356.3080 | 1.707939 | 0.1339178 | 12.75364 | 0 | 0 | 1 |
| 8830 | ENSG00000229117.8 | 7121.3363 | -1.270634 | 0.1061629 | -11.96871 | 0 | 0 | -1 |
| 7342 | ENSG00000035403.17 | 855.0647 | -1.158327 | 0.1024376 | -11.30764 | 0 | 0 | -1 |
| 587 | ENSG00000177606.6 | 556.6779 | 1.267243 | 0.1122316 | 11.29133 | 0 | 0 | 1 |
| 4407 | ENSG00000038274.16 | 3176.8050 | -1.277175 | 0.1135558 | -11.24711 | 0 | 0 | -1 |
| 7462 | ENSG00000099194.5 | 27347.1594 | -1.157671 | 0.1036467 | -11.16940 | 0 | 0 | -1 |
| 734 | ENSG00000134215.15 | 2470.3706 | 1.200234 | 0.1078677 | 11.12691 | 0 | 0 | 1 |
Or, the exploratory plots of the example dataset produced by DESeq2 will be as follows.
To demonstrate the full power of the following analyses, the pipeline will use the pre-run count tables and DEG lists from the full sequencing datasets.
Please make sure that there is no any changes on the metadata file (i.e.,example/DLBC/DLBC.metadata.txt).
Otherwise, it will be considered as not running for practice and should expect the following results not be the same as shown here.
In this step, the pipeline will collect DEG statistics and identify the overlapping set of identified DEGs from the previous methods.
The BDS code snippet for the sample KO01 will look like:
$ grep -A1 run.lociStat.featurecounts.star.DLBC.sh example/DLBC/Submit_RNAseq.DLBC.bds
dep( [ '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/LociStat/featurecounts/star/DLBC/DLBC.star.featurecounts.overlap.txt' ] <- [ '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC_full/RNAseq/DEG/edger/featurecounts/star/DLBC.star.featurecounts.edger.test.DEG.txt', '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC_full/RNAseq/DEG/deseq2/featurecounts/star/DLBC.star.featurecounts.deseq2.test.DEG.txt', '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC_full/RNAseq/DEG/limma/featurecounts/star/DLBC.star.featurecounts.limma.test.DEG.txt' ], cpus := 4, mem := 32*G, timeout := 72*hour, taskName := "lociStat.featurecounts.star.DLBC") sys bash /home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/shell_scripts/run.lociStat.featurecounts.star.DLBC.sh; sleep 2
goal( [ '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/LociStat/featurecounts/star/DLBC/DLBC.star.featurecounts.overlap.txt' ] )
This code chunk will invoke the bash script (e.g., example/DLBC/RNAseq/shell_scripts/run.lociStat.featurecounts.star.DLBC.sh) to collect DEG statistics and to make a Venn diagram plot.
After the completion of the entire pipeline, you can check the statistics result of DEGs per method; for instance, the example dataset DLBC will be as follows.
$ grep -A5 'Up/Down regulated DEGs per methods' example/DLBC/RNAseq/LociStat/featurecounts/star/DLBC/run.lociStat.featurecounts.star.DLBC.log | tail -n+2
INFO [2018-07-25 15:18:08] #STAT: Up/Down regulated DEGs per methods
edger deseq2 limma
-1 484 409 408
0 0 0 0
1 387 362 395
Sum 871 771 803
$ cut -f1,2,4 example/DLBC/RNAseq/LociStat/featurecounts/star/DLBC/DLBC.star.featurecounts.VennList.txt
| Methods | Method.Num | ID.Num |
|---|---|---|
| edger | 1 | 40 |
| deseq2 | 1 | 14 |
| limma | 1 | 45 |
| edger&deseq2 | 2 | 85 |
| edger&limma | 2 | 86 |
| deseq2&limma | 2 | 12 |
| edger&deseq2&limma | 3 | 660 |
There is a Venn diagram plot will be generated after this step.
In this step, the pipeline will make a PCA plot based on the transcriptional profiling of all samples.
The BDS code snippet for the sample KO01 will look like:
$ grep -A1 run.quantQC.pca.featurecounts.star.DLBC.sh example/DLBC/Submit_RNAseq.DLBC.bds
dep( [ '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/QuantQC/featurecounts/star/DLBC.star.featurecounts.pca.pdf' ] <- [ '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC_full/RNAseq/DEG/deseq2/featurecounts/star/DLBC.star.featurecounts.deseq2.count.txt' ], cpus := 4, mem := 32*G, timeout := 72*hour, taskName := "pca.featurecounts.star.DLBC") sys bash /home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/shell_scripts/run.quantQC.pca.featurecounts.star.DLBC.sh; sleep 2
goal( [ '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/QuantQC/featurecounts/star/DLBC.star.featurecounts.pca.pdf' ] )
This code chunk will invoke the bash script (e.g., example/DLBC/RNAseq/shell_scripts/run.quantQC.pca.featurecounts.star.DLBC.sh) to make a PCA plot based on the alignment quantification result generated by DESeq2 or one of DE analysis tools.
After the completion of the entire pipeline, you can check the PCA plot under the folder of QuantQC/.
In this step, the pipeline will make a heat map based on the overlapping set of DEGs identified across different DE analysis tools.
The BDS code snippet for the sample KO01 will look like:
$ grep -A1 run.postAna.pheatmap.featurecounts.star.DLBC.sh example/DLBC/Submit_RNAseq.DLBC.bds
dep( [ '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/PostAna/pheatmap/featurecounts/star/DLBC/DLBC.star.featurecounts.heatmap.pdf' ] <- [ '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/LociStat/featurecounts/star/DLBC/DLBC.star.featurecounts.overlap.txt' ], cpus := 1, mem := 8*G, timeout := 72*hour, taskName := "postAna.pheatmap.featurecounts.star.DLBC") sys bash /home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/shell_scripts/run.postAna.pheatmap.featurecounts.star.DLBC.sh; sleep 2
goal( [ '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/PostAna/pheatmap/featurecounts/star/DLBC/DLBC.star.featurecounts.heatmap.pdf' ] )
This code chunk will invoke the bash script (e.g., example/DLBC/RNAseq/shell_scripts/run.postAna.pheatmap.featurecounts.star.DLBC.sh) to make a heat map plot based on the overlapping set of DEGs identified across different DE analysis tools (i.e., edgeR, DESeq2, and limma).
After the completion of the entire pipeline, you can check the heat map under the folder of PostAna/pheatmap.
In this step, the pipeline will conduct enrichment analysis and make several exploratory plots based on the overlapping set of DEGs identified across different DE analysis tools.
The BDS code snippet for the sample KO01 will look like:
$ grep -A1 run.postAna.clusterprofiler.featurecounts.star.DLBC.sh example/DLBC/Submit_RNAseq.DLBC.bds
dep( [ '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/PostAna/clusterprofiler/featurecounts/star/DLBC/DLBC.star.featurecounts.enrichGO.ALL.txt' ] <- [ '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/LociStat/featurecounts/star/DLBC/DLBC.star.featurecounts.overlap.txt' ], cpus := 1, mem := 8*G, timeout := 72*hour, taskName := "postAna.clusterprofiler.featurecounts.star.DLBC") sys bash /home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/shell_scripts/run.postAna.clusterprofiler.featurecounts.star.DLBC.sh; sleep 2
goal( [ '/home/USER/cri_rnaseq/cri_rnaseq_2018/example/DLBC/RNAseq/PostAna/clusterprofiler/featurecounts/star/DLBC/DLBC.star.featurecounts.enrichGO.ALL.txt' ] )
This code chunk will invoke the bash script (e.g., example/DLBC/RNAseq/shell_scripts/run.postAna.clusterprofiler.featurecounts.star.DLBC.shh) to conduct enrichment analyses including GO and KEGG pathway erichment analyses as well as gene set enrichment analysis (GSEA).
After the completion of the entire pipeline, you can check the heat map under the folder of PostAna/pheatmap.
$ ls example/DLBC/RNAseq/PostAna/clusterprofiler/featurecounts/star/DLBC/
DLBC.star.featurecounts.enrichGO.ALL.cnetplot.pdf DLBC.star.featurecounts.enrichGSEAGO.ALL.neg001.pdf DLBC.star.featurecounts.enrichKEGG.cnetplot.pdf
DLBC.star.featurecounts.enrichGO.ALL.dotplot.pdf DLBC.star.featurecounts.enrichGSEAGO.ALL.pos001.pdf DLBC.star.featurecounts.enrichKEGG.dotplot.pdf
DLBC.star.featurecounts.enrichGO.ALL.emapplot.pdf DLBC.star.featurecounts.enrichGSEAGO.ALL.txt DLBC.star.featurecounts.enrichKEGG.emapplot.pdf
DLBC.star.featurecounts.enrichGO.ALL.txt DLBC.star.featurecounts.enrichGSEAKEGG.pos001.pdf DLBC.star.featurecounts.enrichKEGG.txt run.GSEA.featurecounts.star.DLBC.log
Below, several plots will be generated based on the overlapping set of DEGs using GO database as an example.
Considering the environment setting in the CRI HPC system, BigDataScript was used as a job management system in the current development to achieve an automatic pipeline. It can handle the execution dependency of all sub-task bash scripts and resume from a failed point, if any.
After the completion of the entire pipeline, you will see a BigDataScript report in HTML under the pipeline folder. For instance, this is the report from one test run. The graphic timeline will tell you the execution time per sub-task script.
TBC